FIFA 20 Dataset Analysis

Table of content

Our Aim

Introduction

Questions:

We tried to make analysis for some ideas:

Load the packages

Data Wrangling

Show the dimention of dataset

Describe statistic for data

View top 5 and last 5 from dataset

Attributes of dataset

Check duplicated rows

What is the structure of your dataset?

What is/are the main feature(s) of interest in your dataset?

What features in the dataset do you think will help support your investigation into your feature(s) of interest?

Data Preprocessing

Data Cleaning

Define

Code :

Test :

Define

Code :

Check for missing values
Plotting the percentage of missing values

This pie chart show columns have more than 10% of missing values. Here columns related to player tags and traits & goalkeeper attributes have the highest missing values.

Test :

Add New Feature

Body Mass Index :

Use weight and height to calculate Body Mass Index for all players to use it in our analysis.

Code :

Data Exploration

Now compute statistic and create visualizations for questions:

Questions

Univariate Exploration

Question 1 :

Distributed of some numerical attributes.

Code :

we can more display for the distributions by boxplot

Code :

Question 2 :

Count of players by age.

Code :

This bar graph count players by age, display that there are more players in age 22.

Question 3 :

Count of players by nationality.

Code :

Question 4 :

Count of players by preferred foot.

Code :

Pie chart for preferred foot, we can see that percentage of the preferred foot for right foot is more than left foot in fifa 2020 dataset.

Question 5 :

Count of players by BMI(Body Mass Index).

Code :

bar graph for count players by body mass index(BMI), show that body mass index(BMI) for many players is 22.more palyer in the normal weight

Question 6 :

Players have the max values in some feature(overall ,age, potential, value_eur, wage_eur,etc..)

Code :

Question 7 :

Clubs have top 20 players with a high overall.

Code :

Pie chart for clubs have top 20 players, show that FC Barcelona has more top players than other clubs and Real Madrid & Liverpool are next with percentage 15%.

Question 8 :

Nationality have top 10 players with a high overall.

Code :

Pie chart for nationality have top 20 players, show that Spain has more top players than other countries.

Question 9 :

Age has Max players in Top 20

Code :

Bar graph for count top 20 players by age, show that the most age of top 20 players is 28 years.

Question 10 :

Proportion of Player's per Position
Percentage of players in Defender Role

Code :

Pie chart for players in Defender Role in team position, show that RCB & LCB have more player with percentage 24.8%, and the next are LB & RB with percentage 21.1%.

Percentage of players in Midfielder Role

Code :

Pie chart for players in Midfielder Role in team position, show that LCM & RCM have more player with percentage 15.1%, and the next is RM with percentage 14.7%.

Percentage of players in Attacker Role

Code :

Pie chart for players in Attacker Role in team position, show that ST has more player with percentage 37.4%, and the next are LS & RS with percentage 15.9%.

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

Bivariate Exploration

Question 11 :

Does increasing of overall rating depend on age?

Code :

Scatter plot between Age and overall rating, we result

Question 12 :

Does increasing in overall depend on body mass index (BMI) ?

Code :

Scatter plot between BMI and overall rating, we result

Question 13 :

Relation between height and weight.

Code :

Scatter polt between height and weight, show that Height and weight are linearly dependant

Question 14 :

Relation between weight and pace.

Code :

Scatter plot between pace and weight, Pace tends to decrease with increase in weight.

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Multivariate Exploration

Question 15 :

Relation between Overall Rating , Value in Euros and Age.

Code :

Question 16 :

Relation between overall , potential and age.

Code :

There is linearly dependant relation between Overall Rating and Potential, we can see that the younger players have the highest potential.

Question 17 :

Correlation between features.

Code :

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Question 18 :

Comparison between (MESSI VS RONALDO).

Code :

Question 19 :

Comparison between (MESSI VS M. Salah).

Code :

Question 21 :

Comparison between (Ronaldo VS M. Salah).

Hypothesis Test

We have two different playing foots( Left foot and Right foot) and would like to show if preferred foot has impact on overall rating and attacking finishing or not.

Overall Rating

Code :

Attacking Finishing

Code :

Models for predict players based on Premier League Clubs in FIFA 20

First select important feature to help us in prediction.

Code :

Then select clubs based on Premier League.

Code :

Replacing them with their ranking in the Premier League.

Code :

See count of clubs in Premier League.

Code :

Check the change

Code :

Overall vs Premier League Clubs

Code :

Wage vs Player in Premier League Clubs

Code :

Plot Players in Premier League Clubs by Potential and Value

Code :

Classifying with Decision Tree

X - all features except the club , y - clubs

Code :

80/20 train & test split

Code :

Standardize data

Code :

Scale both the train and the test set

Code :

Apply Decision Tree

Code :

Initialize the decision tree classifier

Code :

Train the model

Code :

Get prediction

Code :

Confusion matrix

Code :

Test set accuracy

Code :

KNN Cluster Model

Divide data to train and testing

Code :

Import library and scalling the data

Code :

Apply KKN Cluster

Code :

Train the model

Code :

Get prediction

Code :

Confusion matrix

Code :

Test set accuracy

Code :

Support Vector Machine (SVM) Model

Divide data to train and testing

Code :

Import library and scalling the data

Code :

Apply SVM

Code :

Train the model

Code :

Get prediction

Code :

Confusion Matrix

Code :

Test set accuracy

Code :

Random Forest Classifier (RFC) Model

Divide data to train and testing

Code :

Import library and scalling the data

Code :

Apply RFC

Code :

Train the model

Code :

Get prediction

Code :

Confusion Matrix

Code :

Test set accuracy

Code :

Conclusion

After exploration, find that there is high correlation between:

The result of exploration

The result of hypothesis test

The result of models

Limitations of data